Probability ON ERGODIC TWO - ARMED BANDITS
نویسندگان
چکیده
A device has two arms with unknown deterministic payoffs, and the aim is to asymptotically identify the best one without spending too much time on the other. The Narendra algorithm offers a stochastic procedure to this end. We show under weak ergodic assumptions on these deterministic payoffs that the procedure eventually chooses the best arm (i.e. with greatest Cesaro limit) with probability one, for appropriate step sequences of the algorithm. In the case of i.i.d. payoffs, this implies a “quenched” version of the “annealed” result of Lamberton, Pagès and Tarrès in 2004 [6] by the law of iterated logarithm, thus generalizing it. More precisely, if (η`,i)i∈N ∈ {0, 1}N, ` ∈ {A,B}, are the deterministic reward sequences we would get if we played at time i, we obtain infallibility with the same assumption on nonincreasing step sequences on the payoffs as in [6], replacing the i.i.d. assumption by the hypothesis that the empirical averages ∑n i=1 ηA,i/n and ∑n i=1 ηB,i/n converge, as n tends to infinity, respectively to θA and θB , with rate at least 1/(logn), for some ε > 0. We also show a fallibility result, i.e. convergence with positive probability to the choice of the wrong arm, which implies the corresponding result of [6] in the i.i.d. case.
منابع مشابه
Calibrated Fairness in Bandits
We study fairness within the stochastic,multi-armed bandit (MAB) decision making framework. We adapt the fairness framework of “treating similar individuals similarly” [5] to this seing. Here, an ‘individual’ corresponds to an arm and two arms are ‘similar’ if they have a similar quality distribution. First, we adopt a smoothness constraint that if two arms have a similar quality distribution ...
متن کاملSemi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks and combinatorial semi-bandits. The former concerns limited “resources” consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, supp...
متن کاملA study of Thompson Sampling with Parameter h
Thompson Sampling algorithm is a well known Bayesian algorithm for solving stochastic multi-armed bandit. At each time step the algorithm chooses each arm with probability proportional to it being the current best arm. We modify the strategy by introducing a paramter h which alters the importance of the probability of an arm being the current best arm. We show that the optimality of Thompson sa...
متن کاملThompson Sampling for Contextual Bandits with Linear Payoffs
Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateof-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we d...
متن کاملVerification Based Solution for Structured MAB Problems
We consider the problem of finding the best arm in a stochastic Multi-armed Bandit (MAB) game and propose a general framework based on verification that applies to multiple well-motivated generalizations of the classic MAB problem. In these generalizations, additional structure is known in advance, causing the task of verifying the optimality of a candidate to be easier than discovering the bes...
متن کامل